Experience with Applying Formal Methods to Protocol Specification and System Architecture
نویسندگان
چکیده
In the last three years or so we at Enterprise Platforms Group at Intel Corporation have been applying formal methods to various problems that arose during the process of defining platform architectures for Intel’s processor families. In this paper we give an overview of some of the problems we have worked on, the results we have obtained, and the lessons we have learned. The last topic is addressed mainly from the perspective of platform architects. 1. Problems and Results Modern computer systems are highly complex distributed systems with many interacting components. Architecturally they are often organized like a computer network into multiple layers: physical layer, link layer, protocol layer, etc. Most of the problems to which we applied formal methods are the formal verification (FV) of intricate protocols in the protocol and link layers. In addition, we also found several novel uses of binary decision diagrams (BDDs) [3] that are worth mentioning. 1.1. Directory-based cache coherence protocols A significant portion of our work centers around the formal modeling and verification of directory-based cache coherence protocols in Scalability Port (SP), which is a family of scalable distributed shared memory architectures based on high-speed point-to-point interconnect technologies and whose first incarnation was implemented in Intel’s 870 chipset [2]. Directory-based cache coherence protocols are complex distributed algorithms that must operate correctly in the face of asynchrony, unpredictable message delays, and multiple paths between agents. Due to the astronomical number of possible executions, neither unaided human reasoning nor traditional simulation-based validation can be relied upon to flush out all bugs. Only exhaustive state space exploration offered by FV techniques can produce a high degree of confidence in the correctness of a protocol specification. ∗ To whom correspondences should be addressed: [email protected] c © 2003 Kluwer Academic Publishers. Printed in the Netherlands. intel-mpa.tex; 20/01/2003; 11:40; p.1 2 Mani Azimi, el al. Protocol specification Extracted p-tables Boolean rules Generated p-tables Non-tablebased code FV model Reference model Model checking Simulation = Find “easy” bugs in protocol Find “hard” bugs in protocol Find bugs in implementations Figure 1. Verification flow of SP cache coherence protocols Our verification flow is shown in Fig.1. The core logic of the cache coherence protocol is specified by a set of protocol tables, or p-tables for short. There are two reasons for adopting a tabular representation. First, tables are more structured and precise than texts or pictures and are relatively easy to read by both humans and machines. Second, “easy” errors in tables can be caught by “light-weight FV” based on Boolean rules, which will be explained in Sec.1.3.1. After the p-tables are checked by Boolean rules, they are translated by a simple compiler into code fragments that are included into an FV model. The translation guarantees that the p-tables and the FV model assume the same granularity of atomic actions. The FV model is then augmented with assertions expressing desired properties and exhaustively analyzed by symbolic model checking [8]. Finally, the FV model is translated by another, rather nontrivial, compiler into an executable reference model in C for use in simulation. For example, the reference model can be used as a checker of protocol rules that flags an error whenever a simulated microarchitectural or RTL model takes a step that is not allowed by the protocol specification. There are two types of properties that we have found useful. The first is the standard safety properties: whenever the state of a cache is not Invalid, then its data content is up-to-date; no two caches can be in Modified or Exlusive state at the same time; and so on. But, while they are useful for checking that certain “bad things” can never intel-mpa.tex; 20/01/2003; 11:40; p.2 Applying Formal Methods to Protocol Specification and System Architecture 3 happen, safety properties don’t guarantee that any “good things” will ever happen. So we also assert a set of weak liveness properties of the form: AG EF (cs = CS), which states that, from every reachable state, there exists an execution placing the “control state” cs of a protocol structure (e.g., the MESI state of a cache) in any one of its possible values CS (e.g., any of the MESI states). By systematically asserting weak liveness properties for all protocol structures and all their control states, we guard against deadlocks, missing cases in protocol tables, and other unexpected protocol scenarios. Since the complexity of symbolic model checking is very sensitive to the number of state bits in a model, we use the technique of refinement mappings [9] to reduce that number. There are two main types of refinement mappings that we have found useful. First, during its lifetime, a transaction in a cache coherence protocol travels through multiple protocol structures, each of which has a “transaction type” field to track the type of the transaction currently occupying it. Most of these fields are redundant in the sense that their values can be derived from that of the one in the originating structure and hence can be eliminated by refinement mappings. Second, we introduce an auxiliary variable to track the most up-to-date data value of any address and express each state variable in the “data path” of the model in terms of the auxiliary variable and when the state variable is “valid”. Note, however, that this type of refinement mappings would be less effective if the cache coherence protocol allows multiple “up-to-date” values at the same time for an address. Fortunately, SP is not like that. Results: It is now routine for us to make changes to the protocol tables, update the Boolean rules to reflect the changes, check the agreement between tables and rules to flush out “easy” bugs, generate an FV model from the tables, apply model checking to the FV model to detect “hard” bugs, trace the bugs back to and fix them in the tables, and re-iterate the cycle. After the FV model is formally verified, a new executable reference model is generated at the click of a button. Generally speaking, complete (in the sense of covering all transaction types, of which there are more than 20) models with 1 home node and 2 caching nodes can be model-checked without any special effort. Our intuition is that, due to the nature of the protocol, this configuration covers “essentially all” scenarios. We are currently in the process of formalizing this intuition. Not surprisingly, most bugs we found were introduced by major protocol changes (e.g., new transaction types, changes to conflict resolution mechanisms, etc). However, even relatively minor modifications could sometimes have consequences unforeseen by people who knew the protocol very well. All these confirm our earlier statement that, as far intel-mpa.tex; 20/01/2003; 11:40; p.3 4 Mani Azimi, el al. as directory-based cache coherence protocols are concerned, unaided human reasoning is not to be trusted. 1.2. Variations of sliding window protocols For electrical and physical design reasons, more and more high-speed interconnects in future computer systems will shift from shared-bus to point-to-point technologies. This is true both for processor interconnects (e.g., the simultaneous bidirectional signaling technology used in 870 [6]) and for I/O subsystems (e.g., PCI Express [1], which is an upcoming replacement of PCI bus). In the link layer of a point-topoint system, a variety of sliding window protocols (SWPs) perform error recovery, flow control, and other functions. But, due to various design constraints, SWPs in real systems often deviate from “textbook” SWPs in subtle ways. For example, to reduce pin-count overhead in 870, link-level error recovery is implemented using a SWP in which acknowledgements are signaled by tokens that are counted, rather than by sequence numbers. Our experience shows that such seemingly innocuous modifications can in fact be extremely tricky and hence call for formal verification. Results: We have formally verified the link-level error recovery and flow control protocols in both 870 and PCI Express and have found very subtle problems in some of these protocols. Our work led to both changes in RTL design (for 870) and clarifications of specifications (for PCI Express). 1.3. Novel applications of binary decision diagrams Although BDDs are the real workhorse underlying such widely used FV technologies as symbolic model checking [8] and symbolic trajectory evaluation [11], the users of FV tools seldom manipulate BDDs directly. Yet we have found that BDDs can be used to attack some problems that are not traditionally considered in the FV literature. In addition to BDDs, another prerequisite for the work described below is a good scripting language (e.g., FL in the FORTE environment [7]) that supports easy manipulation of BDDs. 1.3.1. Rule-based checking of tables As mentioned in Sec.1.1, we use “light-weight FV” based on Boolean rules to flush out “easy” bugs in long and complex tables in specification documents. Our methodology is based on the observation that the contents of a specification table are not random, but typically exhibit a great deal of regularities. It is possible to capture those regularities as Boolean rules (or, simply, rules) that express the expected relationships intel-mpa.tex; 20/01/2003; 11:40; p.4 Applying Formal Methods to Protocol Specification and System Architecture 5 among the entries in any row of the table. Our experience shows that, in fact, it is not hard to articulate sufficiently many rules to completely characterize a table in the sense that the set of rows that satisfies the constraints imposed by the rules is exactly the same as the set of rows in the manually constructed table in the specification document. Thus the correctness and completeness of the latter can be checked by comparing it against the former. Any disagreement between the two indicates an error in one or both. The reader may ask: why not dispense with manual table construction altogether and generate tables directly from rules? The reason is that enumerating a table’s rows explicitly and coding rules that implicitly define the rows are two fundamentally different but complementary activities. When one starts to construct a new table, it is far easier to capture one’s intention by enumerating rows than by coding rules, because one has no clear idea about the general shape of the table at that point. As the table matures and grows in size, inspection becomes more tedious and less fruitful in finding bugs. Then it is time to switch to the complementary point of view, observe regularities in the table, and code rules to capture them. The table and the rules are checked against each other and eventually converge to the same set of rows. Ideally, the table and the rules should be coded by two different persons, in order to maximize the difference in points of view. Later, when the table undergoes modification (as specifications inevitably do over time), one does not rely on visual inspection to ensure that the internal consistency of the table is maintained. Rather, the consistency conditions have been captured by the rules, which can now be automatically compared against the new table to catch consistency violations. Results: When rule-based table checking is applied to a set of tables for the first time, dozens of problems are typically found. Most of these problems are trivial, such as misspelled, misplaced, or otherwise incorrect entries. But some of the problems are more serious, such as missing cases and systematic mistakes across multiple tables. After several iterations, the tables and rules quickly converge to agree with each other. From then on the maintenance of tables and rules does not require much work, unless the change is major. In particular, we found that changing the rules to keep up with the tables rarely requires more work than changing the tables themselves. 1.3.2. Search for minimal deadlock-free wormhole routing schemes There is a large body of research literature on deadlock-free wormhole routing schemes (e.g., see [5, 10]). However, when given a specific interconnect topology, general deadlock-free routing techniques found in the literature (e.g., dimension routing [5]) sometimes do not yield a minimal intel-mpa.tex; 20/01/2003; 11:40; p.5 6 Mani Azimi, el al. routing scheme (i.e., one in which every path used by the scheme is a shortest path). Sometimes it may not even be obvious whether a minimal deadlock-free routing scheme exists at all. If the topology is not too large, BDD-based techniques can be used to exhaustively search for such schemes, as described below. By definition, a minimal routing scheme uses only shortest paths between pairs of nodes in a topology. But there are only a finite number of shortest paths between any pair of nodes. Thus one can introduce sufficiently many Boolean variables, called selector variables, to index the set of shortest paths between all pairs of nodes, so that each truth assignment to the selector variables corresponds to a minimal routing scheme. A routing scheme induces a channel dependency graph, as follows. The vertices of the graph are the virtual channels available in the topology, which in the simplest case can be identified with directed links in the topology (i.e., only one virtual channel per direction per link). The edges of the graph represent the dependency relation between channels imposed by the paths used by the routing scheme: channel c depends on channel c whenever the routing scheme uses a path that goes through c and then c. It is not hard to see that the channel dependency graph can be represented by a BDD, among whose variables are the selector variables mentioned earlier. (More precisely, the BDD represents a family of channel dependency graphs, one per truth assignment to the selector variables, which chooses a minimal routing scheme.) It is shown in [5] that a routing scheme is deadlockfree if and only if the channel dependency graph it induces is acyclic. A graph is acyclic if and only if its transitive closure contains no selfloop. The transitive closure of a graph represented by a BDD can be computed using fixpoint iteration. Conclusions: (1) the problem of whether a topology permits a minimal deadlock-free routing scheme can be reduced to that of whether there is a truth assignment to the selector variables so that the transitive closure of the resulting channel dependency graph has no self-loop; (2) the latter problem is solvable by BDD manipulation (at least when the topology is not too large). Results: We have applied the methodology outlined above to several topologies that came up during the interconnect architecture exploration process. In one of them, there are 64 pairs of nodes between each of which two alternative shortest paths exist (all other pairs of nodes have unique shortest paths between them), giving rise to 64 selector variables. After several hours of BDD computation, we found that none of the 2 possible minimal routing schemes is deadlock-free. It is hard to imagine how such a result could be obtained without BDD. Acknowledgements: We would like to thank Jay Jayasimha and Aniruddha Vaidya for valuable help on this problem. intel-mpa.tex; 20/01/2003; 11:40; p.6 Applying Formal Methods to Protocol Specification and System Architecture 7 1.3.3. Search for fault-tolerant link initialization sequences During link initialization the two state machines at the two ends of a link must coordinate their transitions by exchanging delimiters. The precise bit patterns denoting these delimiters must be fault-tolerant in the sense that a small number of misrecognized bits cannot lead one delimiter to be confused with another delimiter, with the same delimiter but in a wrong time frame, or with the bits that are sent between delimiters. (Since one of the purposes of link initialization is to “train” the receiver circuitry, misrecognition of bits is more likely during initialization than after it.) This delimiter selection problem has several parameters. Each of the d delimiters is D bits long. The detection process aims to tolerate any e-bit errors in any E-bit windows. The receiver may use a C-bit comparator (C ≥ D) to detect the delimiters from the received bit stream. (Because the link operates at very high speeds, only simple comparator-based designs can be used.) For example, suppose that we have two 4-bit delimiters “0000” and “1111”. When they are transmitted within alternating bit streams, either “. . . 010100000101 . . . ” or “. . . 010111110101 . . . ” is received, where the delimiters are underlined for clarity. Any single-bit errors in 8-bit windows cannot cause a wrong detection when the receiver uses a 4-bit comparator. However, the same two delimiters cannot tolerate 2-bit errors in 8-bit windows using 4bit comparators, because the first bit stream above may be corrupted to “. . . 0101010000 01 . . . ” and lead to the “detection” of the right delimiter in the wrong time frame, or to “. . . 011110000101 . . . ” and lead to the “detection” of the wrong delimiter, where the corrupted bits are slanted and the mis-detected delimiters underlined. Results: For historical reasons, we wrote a C program that, given a set of parameters, exhaustively enumerates all possible sets of delimiters and tests each of them against all possible error patterns. We are currently in the process of re-coding so that we can perform the same task symbolically using BDDs, since exhaustive enumeration in C is running out of steam due to increasing parameter values. We include the problem here because its spirit is very similar to the problem in Sec.1.3.2 and our approach to it is clearly influenced by “formal methods thinking”.
منابع مشابه
Web Service Choreography Verification Using Z Formal Specification
Web Service Choreography Description Language (WS-CDL) describes and orchestrates the services interactions among multiple participants. WS-CDL verification is essential since the interactions would lead to mismatches. Existing works verify the messages ordering, the flow of messages, and the expected results from collaborations. In this paper, we present a Z specification of WS-CDL. Besides ve...
متن کاملFormal Specification and Analysis of Active Networks and Communication Protocols: The Maude Experience
Rewriting logic and the Maude language make possible a new methodology in which formal modeling and analysis can be used from the earliest phases of system design to uncover many errors and inconsistencies, and to reach high assurance for critical components. Our methodology is arranged as a sequence of increasingly stronger methods, including formal modeling, executable specification, modelche...
متن کاملArchitecture: Requirements + Decomposition + Refinement
This paper focuses on the system requirements and architecture w.r.t. their decomposition and refinement: how the refinement-based verification can be used to optimize verification process, and which influences it has on the specification process. We introduce here specification decomposition methods, applying which ones can not only to keep the specification readable and manageable, but also f...
متن کاملPractical Experience Applying Formal Methods to Air Traffic Management Software
This paper relates experiences with formal methods that are relevant to the systems engineering activities of requirements specification, design documentation, and test case generation. Specifically, this paper reviews the lessons learned from the application of formal methods to selected components of an air traffic management system. This project used experimental tools developed at the Unive...
متن کاملA Case Study on the Use of SDL
This paper presents the experience the authors gained in applying formal methods — mainly MSC and SDL — when specifying a reactive system. The experience not only deals with the descriptions of the system, but also with the methodology used to develop the descriptions.
متن کاملTools for the Governance of Urban Design: The Tehran Experience
This research seeks to reflect the managerial, academic and professional experience of the authors in the design and implementation process of urban design projects, aiming to use the application of the “design governance” model, in order to describe the documents and activities of the Department of Urban Planning and Architecture of Tehran Municipality in the last decade. This paper consists ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Formal Methods in System Design
دوره 22 شماره
صفحات -
تاریخ انتشار 2003